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Abstract 

In domains like bioinformatics, information retrieval and social net- 
work analysis, one can find learning tasks where the goal consists of infer- 
ring a ranking of objects, conditioned on a particular target object. We 
present a general kernel framework for learning conditional rankings from 
various types of relational data, where rankings can be conditioned on 
unseen data objects. Conditional ranking from symmetric or reciprocal 
relations can in this framework be treated as two important special cases. 
Furthermore, we propose an efficient algorithm for conditional ranking by 
optimizing squared regression and ranking loss functions. Experiments 
on synthetic and real- world data illustrate that such an approach delivers 
state-of-the-art performance in terms of predictive power and computa- 
tional complexity. Moreover, we also show empirically that incorporating 
relational domain knowledge can improve the generalization performance. 



1 Introduction 

Let us start with two introductory examples to explain the problem setting of 
conditional ranking. First, suppose that a number of persons are playing an 
online computer game. For many people it is always more fun to play against 
someone with similar skills, so players might be interested in receiving a ranking 
of other players, ranging from extremely difficult to beat to novice players with 
no experience at all. Unfortunately, pairwise strategies of players in many games 
- not only in computer games but also in board or spo r ts ga mes - tend to 
exhibit a rock-paper-scissors type of relationship ( Fisher . 20081) . in the sense 



that player A beats with a high probability player B, who on his term beats 
with a high probability person C, while player A has a high chance of losing 
from the same player C. Mathematically speaking, the relation between players 
is not transitive, leading to a cyclic relationship and implying that no global 
(consistent) ranking of skills exists. Yet, a condition al ranking can always be 
obtained for a specific player (jPahikkala et al . l2010bh . 



As a second introductory example, let us consider the supervised inference 
of biological networks, like protein-protein interaction networks, where the goal 
usually co nsists of predicting new interactions from a set of highly-confident in- 
teractions ( Yamanishi et al 2004 ). Similarly, one can also define a conditional 



ranking task in such a context, as predicting a ranking of a ll proteins in the net - 
work that are likely to interact with a given target protein (IWeston et al , 12004 . 



However, this conditional ranking task differs from the previous one because (a) 
rankings are computed from symmetric relations instead of reciprocal ones and 
(b) the values of the relations are here usually not continuous but discrete. 

Applications for conditional ranking tasks arise in many domains where re- 
lational information between objects is observed, such as relations between per- 
sons in preference modelling, social network analysis and game theory, links be- 
tween database objects, documents, websites, or images i n info rmation retrieval 
(|Geerts et al 12004 iGrangier and Bengiol 120081: iNg et al 1201 lh. interactions be- 



tween genes or^dns in bioinformatics; ^h^S K, m et al l » 



etc. When approaching conditional ranking from a graph inference point of view, 
the goal consists of returning a ranking of all nodes given a particular target 
node, in which the nodes provide information in terms of features and edges 
in terms of labels or relations. At least two properties of graphs play a key 
role in such a setting. First, the type of information stored in the edges de- 
fines the learning ta sk: binary-valued edge labels lead to bipartite ranking tasks 
( Freund et a 1, 2003), ordinal-valu ed edge labels to multipartite or layered rank- 



ing tasks ( Fiirnkranz et al . 2009? ) and continuous labels result in rankings that 



are nothing more than total orders (when no ties occur). Second, the relations 
that are represented by the edges might have interesting properties, namely 
symmetry or reciprocity, for which conditional ranking can be interpreted dif- 
ferently. 

We present a kernel framework for conditional ranking, which covers all 
above situations. Unlike existing single-task or multi-task ranking algorithms, 
where the conditioning is respectively ignored or only happening for training 
objects, our approach also allows to condition on new data objects that are 
not known during the training phase. Thus, in light of Figure Q] that will be 
explained below, the algorithm is not only able to predict conditional rankings 
for objects A to E, but also for objects F and G that do not participate in the 
training dataset. From this perspective, one can define four different learning 
settings in total: 

• Setting 1: predict a ranking of objects for a given conditioning object, 
where both the objects to be ranked and the conditioning object were 
contained in the training dataset (but not the ranking of the objects for 
that particular conditioning object). 

• Setting 2: given a new conditioning object unseen in the training phase, 
predict a ranking of the objects encountered in the training phase. 

• Setting 3: given a set of new objects unseen in the training phase, pre- 
dict rankings of those objects with respect to the conditioning objects 
encountered in the training phase. 

• Setting 4: predict a ranking of objects for a given conditioning object, 
where neither the conditioning object nor the objects to be ranked were 
observed during the training phase. 

These four settings cover as special cases different types of conventional machine 
learning problems. The framework that we propose in this article can be used 
for all four settings, in contrast to many existing methods. In this paper, we 
focus mainly on setting 4. 
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Setting 1 corresponds to the imputation type of link prediction setting, where 
missing relational values between know n objects are predi cted. In this setting 
matrix factorization methods (see e.g. ISrebro et al (2005)) arc often applied. 
Many of the approaches are based solely on exploiting the available link struc- 
ture, t hough approaches incorporating feature information hav e also been pro- 
posed ( Menon and Elkan , 2010t iRavmond and Kashima , 2010l) . Setting 2 may 
be encountered in such information retrieval problems, where new searches are 
performed against a fixed database of objects. Setting 3 can be modeled as a 
multi-tas k ranking prob lem where the the number of ranking tasks is fixed in 
advance ( Agarwall [20061. Finally, setting 4 requires that the used methods are 



able to generalize both over new conditioning objects and objects to be ranked. 
Learning in setting 4 may be realized by using joint feature representations of 
conditioning objects and objects to be ranked. 

In its most general form the conditional ranking problem can be con- 
sidered as a special case of the listwise ranking problem, encountered es- 
pecially in the l e arning to rank for information retrieval l i teratu r e (s e e e.g . 



pccia iiy in tne l e arning to ranK tor mi ormation retrieval literatur e (s t 
tlCao et all 120061 lYue et~al 120071: Ixia et all [20081: iQin et at 120081: Ehl 
Chapelle and Keerthil. 120101 iQin et all . l2008l IKersting and Xul 12003: IXu 



2009; 



Chapelle and Keerthil. 120101: IQin et all . 120081: IKersting and Xul . 120091: IXu et al 
20101 Airola et all . 2011bl )). For example, in document retrieval one is supplied 



both with query-objects and associated documents that are ranked according 
to how well they match the query. The aim is to learn a model that can gener- 
alize to new queries and documents, predicting rankings that capture well the 
relative degree to which each document matches the test query. Previous learn- 
ing approaches in this setting have typically been based on using hand-crafted 
low-dimensional joint feature representations of query-document pairs. In our 
graph-based terminology, this corresponds to having a given feature represen- 
tation for edges, possibly encoding prior knowledge about the ranking task. 

In contrast, we focus on a setting in which we are only given a feature 
representations of nodes from which the feature representations of the edges 
have to be constructed, that is, the learning must be performed without an 
access to the prior information about the edges. This open many possibilities 
for applications, since we are not restricted to the setting where explicit feature 
representations of the edges are provided. In our experiments, we present several 
examples of learning tasks for which our approach can be efficiently used. In 
addition, the focus of our work is the special case where both the conditioning 
objects and t he ob j ects to be ranked come from the same domain (see e.g. 
Weston et all ( 20041 ) : lYang et all ( 20091 ) for a similar settings). This allows us to 



consider how to enforce relational properties such as symmetry and reciprocity, 
a subject not studied in previous ranking literature. 

The proposed framework is based on the Kronecker product kernel for gener- 
ating implicit joint feature representations of conditioning objects and the sets of 
objects to be ranked. This kernel has been proposed independently by a number 
of research groups for modeling pairwise inputs in different application domains 
iBasil ico and Hofmann . 2004 : Ovama and Manning! . 120041 iBen-Hur and Nobkl 



20051 ). From a different perspective, it has been considered in structured out- 
put pred iction methods for defining joint feature r epres entations of inputs and 
outputs ( Tsochantaridis et~al 20051 Weston et al 2007 ). While the usefulness 
of Kronecker product kernels for pairwise learning has been clearly established, 
computational efficiency of the resulting algorithms remains a major challenge. 
Previously proposed methods require the explicit computation of the kernel 
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matrix over the data object pairs, hereby introducing bottlenecks in terms of 
processing and memory usage, even for modest dataset sizes. To overcome 
this problem, one typically applies sampling strategies of the kernel matrix for 
training. However, non-approximate methods can be implemented by taking 
advantage of the Kronecker structure of the kernel matrix. This ide a has been 
tradit ionally used to solve certain linear regression problems (see e.g. IVan Loan 
( 2000h and references therein) . More recent and related applications have been 
done in link prediction tasks by ([Kashima et al , 2009; iRavmond and Kashimal 
2010h . which can be considered under setting 1. 



We propose conditional ranking algorithms for setting 4, that is, to cases 
where predictions are performed for (couples of) objects that are not observed 
in the training dataset. The algorithms take advantage of computational short- 
cuts based on the Kronecker structure of the kernel matrix. In addition, we 
introduce new computational short-cuts for incorporating the prior knowledge 
about symmetry and reciprocity properties of the underlying relations into the 
training algorithms. These results are, to our knowledge, completely novel in 
the field of both machine learning and matrix algebra. The computational 
efficiency enables the use of training data sets consisting of millions of labeled 
edges, whose feature space dimensionality may be even infinite in case of certain 
nonlinear kernel functions. 

We use a regression-based loss for estimating the edges or weights in a 
graph directly, as well as a ranking-based loss for optimizing the conditional 
ranking criterion more directly. We propose both update rules for iterative 
optimization algorithms, as well as closed-form solutions for certain special 
cases, exploiting the special structure of the Kronecker product to make learn- 
ing effi cient. The proposed methods extend existin g regularized least-squares 
(RLS) dSaunders etal Il998t lEveeniou et ai l2000h regression algorithms and 
the RankRLS (jPahikkala et all |2009[ ) ranking algorithm for conditional ranking 
tasks, while scaling to training graphs consisting of millions of edges. Further- 
more, we show how prior knowledge about the structure of the underlying rela- 
tion can be efficiently incorporated in the learning process. Namely, we consider 
the specific cases of symmetric and reciprocal relations. Thus, incorporation of 
domain knowledge, computational efficiency and the ability to condition on 
objects that do not appear in the training dataset yield three main points of 
interest for the algorithms developed in this article. 

The article is organized as follows. We start in Section [2] with a formal 
description of conditional ranking from a graph-theoretic perspective. The Kro- 
necker product kernel is reviewed in Section [3] as a general edge kernel that 
allows for modelling the most general type of relations. In addition, we briefly 
recall two important subclasses of relations, namely symmetric and reciprocal 
relations, for which more specific, knowledge-based kernels can be derived. The 
proposed learning algorithms are presented in Section HI and the connections 
and differences with related learning algorithms are discussed in Section [5j with 
a particular emphasis on the computational complexity of the algorithms. In 
Section |5] we present promising experimental results on synthetic and real- world 
data, illustrating the advantages of our approach in terms of predictive power 
and computational scalability. 
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Figure 1: Left: example of a multi- graph representing the most general case 
where no additional properties of relations are assumed. Right: examples of 
eight different types of relations in a graph of cardinality three. The following 
relational properties are illustrated: (C) crisp, (V) valued, (R) reciprocal, (S) 
symmetric, (T) transitive and (I) intransitive. For the reciprocal relations, (I) 
refers to a relation that does not satisfy weak stochastic transitivity, while (T) 
is showing an example of a relation fulfilling strong stochastic transitivity. For 
the symmetric relations, (I) refers a relation that does not satisfy T-transitivity 
w.r.t. the Lukasiewicz t-norm Tl(o, b) — max(a + b — 1, 0), while (T) is showing 
an example of a relation that fulfills T-tran s itivity w.r.t. the prod u ct t-n orm 
Tp(a, b) — ab - see e.g. Luce and Suppes ( 1965 ): iDe Baets et a | (120061) for 
formal definitions. 



2 General Framework 

Let us start with introducing some notations. We consider ranking of data 
structured as a graph G = (V,E,Q), where V C V corresponds to the set of 
nodes, where nodes are sampled from a space V, and E C 2 represents the set 
of edges e, for which labels are provided in terms of relations. Moreover, these 
relations are represented by weights y e on the edges and they are generated from 
an unknown underlying relation Q : V 2 — > [0, 1]. We remark that the interval 
[0, 1] is used here only due to certain properties that are historically defined for 
such relations. However, relations ranging to arbitrary closed real intervals can 
be straightforwardly transformed to this interval with an appropriate increasing 
bijection and vice versa. 

Following the standard notations for kernel methods, we formulate our learn- 
ing problem as the selection of a suitable function h £ H, with H a certain hy- 
pothesis space, in particular a reproducing kernel Hilbert space (RKHS). Given 
an input space X and a kernel K : X x X — > R, the RKHS associated with K 
can be considered as the completion of 



5 A' 



f(x)=J2p i K(x,x i ) 



in the norm 
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where fte^meN^.e X. 

Hypotheses h : V 2 — > R are usually denoted as h(e) = (w, $(e)) with w a 
vector of parameters that need to be estimated based on training data. Let us 
denote a training dataset of cardinality q = \E\ as a set T = {(e, y e ) \ e S E} of 
input-label pairs, then we formally consider the following variational problem 
in which we select an appropriate hypothesis h from H. for training data T. 
Namely, we consider an algorithm 

A(T) — argmin£(/i, T) + A||/i||^ (1) 
hen 

with C a given loss function and A > a re gularization parameter 



According to the representer theorem ([Kimeldorf and Wahbal . 119711) . any 



minimizer h G % of ([T]) admits a dual representation of the following form: 

ft(e) = {w,$(e)) = 5]o e Jf«(e^, (2) 

with a e <E K dual parameters, the kernel function associated with the RKHS 
and $ the feature mapping corresponding to if*. 

Given two relations Q(v,v') and Q(v,v") defined on any triplet of nodes in 
V, we compose the ranking of v' and v" conditioned on v as 

v' hv v" &Q(v,v') > Q(v,v"). (3) 

Let the number of correctly ranked pairs for all nodes in the dataset serve as 
evaluation criterion for verifying then one aims to minimize the following 
empirical loss when computing the loss over all conditional rankings simultane- 
ously: 

C(h,T) = J2 E Hh(e)-h(e)), (4) 

v€V e,e£E v :y e <y-e 

with / the Heaviside function returning one when its argument is strictly pos- 
itive, returning 1/2 when its argument is exactly zero and returning zero oth- 
erwise. Importantly, E v denotes the set of all edges starting from, or the set 
of all edges ending at the node v, depending on the specific task. For exam- 
ple, concerning the relation "trust" in a social network, the former loss would 
correspond to ranking the persons in the network who are trusted by a spe- 
cific person, while the latter loss corresponds to ranking the persons who trust 
that person. So, taking Figure [T] into account, we would in such an application 
respectively use the rankings A >~c E >~c D (outgoing edges) and D >~c B 
(incoming edges) as training info for node C . 

Since (]4]) is neither convex nor differentiable, we look for an approximation 
that has these properties as this considerably simplifies the development of 
efficient algorithms for solving the learning problem. Let us to this end start by 
considering the following squared loss function over the q observed edges in the 
training set: 

£(h,T) = Y,(ye-h(e)f. (5) 

eS-E 

Such a setting would correspond to directly learning the labels on the edges in 
a regression or classification setting. For the latter case, optimizing (|S|) instead 
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of the more conventional hinge loss has the advantage that the solution can 
be found by simply solving a system of linear equations (ISaunders et all. Il998t 
Suvkens et all l2002t IShawe- Taylor and Cristianinil l2004t IPahikkala et at 120091) . 
However, the simple squared loss might not be optimal in conditional ranking 
tasks. Consider for example a node v and we aim to learn to predict which of 
the two other nodes, v' or v", would be closer to it. Let us denote e = {v,v') 
and e = (v,v"), and let y e and y^ denote the relation between v and v' and 
between v and v", respectively. Then, it would be beneficial for the regression 
function to have a minimal squared difference (y e — ye — h(e) + h(e)) 2 , leading 
to the following loss function: 

£(h,T) = J2 E (ye-y^-h(e) + h(e)) 2 , (6) 

v£V e,e£E„ 

which can be interpreted as a differentiable and convex approximation of ([3]). 



3 Relational Domain Knowledge 

Above, a framework was defined where kernel functions are constructed over the 
edges, leading to kernels of the form K *(e, e). In this section we show how these 
kernels can be constructed using domain knowledge about the underlying rela- 
tions. The same discussion was put forward for inferring relati ons. Details and 
formal proofs can be found in our previous work on this topic ( Pahikkala et all . 
2010btrWa~egeman et ail2012t) . 



3.1 Arbitrary Relations 

When no further restrictions on the underlying relation can be specified, then 
the following Kronecker product feature mapping is used to express pairwisc 
interactions between features of nodes: 

$(e) = v') = (f)(v) (g> 4>{v') , 

where <j> represents the fe ature mapping for individual nodes. As shown by 



Ben-Hur and Noble] ([20051 ) . such a pairwise feature mapping yields the Kro- 



necker product pairwisc kernel in the dual model: 

K%{e,e) = K^v,v',v,lf) = K*{v,v)K+{v' ,V) , (7) 

with the kernel corresponding to </>. 

It can be formally proven that with an appropriate choice of the node ker- 
nel K^ 7 such as the Gaussian RBF kernel, the RKHS of the corresponding 
Kronecker product edge kernel if* allows approximating arbitrarily closely any 
relation that corresponds to a continuous function from V 2 to R. Before sum- 
marizing this important result, we recollect the definition of universal kernels. 



Definition 3.1. Steinwarh 1200& ) A continuous kernel K on a compact metric 



space X (i.e. X is closed and bounded) is called universal if the RKHS induced 
by K is dense in C(X), where C(X) is the space of all continuous functions 
f : X -> M. 
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Accordingly, the hypothesis space induced by the kernel K can approximate 
any function in C(X) arbitrarily well, and hence it has the universal approxi- 
mating property. 



Theorem 3.2. (Waeaemaneta i. 201k) Let us assume that the space of nodes 



V is a compact metric space. If a continuous kernel is universal on V , then 
defines a universal kernel on V 2 . 



The proof is based on the so-called Stone- Weierstrafi theorem (see e.g. iRudin 



(1991)). The above result is interesting because it shows, given that an appro- 
priate loss is optimized and a universal kernel applied on the node level 
that the Kronecker product pairwise kernel has the ability to assure universal 
consistency, guaranteeing that the expected prediction error converges to its the 
lowest p ossible value when the amount of training data approaches infinity. We 



refer to ( Steinwart and Christmann . 2008) for a more detailed discussion on the 



relationship between universal kernels and consistency. As a consequence, the 
Kronecker product kernel can always be considered a valid choice for learning 
relations if no specific a priori information other than a kernel for the nodes is 
provided about the relation that underlies the data. However, we would like 
to emphasize that Theorem 13.21 does not guarantee anything about the speed 
of the convergence or how large training sets are required for approximating 
the function closely enough. As a rule of thumb, whenever we have an access 
to useful prior information about the relation to be learned, it is beneficial to 
restrict the expressive power of the hypothesis space accordingly. The following 
two sections illustrate this more in detail for two particular types of relational 
domain knowledge: symmetry and reciprocity. 



3.2 Symmetric Relations 

Symmetric relations form an important subclass of relations in our framework. 
As a specific type of symmetric relations, similarity relations constitute the un- 
derlying relation in many application domains where relations between objects 
need to be learned. Symmetric relations are formally defined as follows. 

Definition 3.3. A binary relation Q : V 2 — > [0, 1] is called a symmetric relation 
if for all (v,v ! ) e V 2 it holds that Q(v,v') = Q(v',v). 

More generally, symmetry can be defined for real-valued relations analo- 
gously as follows. 

Definition 3.4. A binary relation h : V 2 — > K is called a symmetric relation if 
for all (v,v') € V 2 it holds that h(v,v') = h(v',v). 

For symmetric relations, edges in multi-graphs like Figure [T] become undi- 
rected. Applications arise in many domains and metric learning or learning 
similarity measures can be seen as special cases. If the relation is 2-valued 
as Q : V 2 — > {0,1}, then we end up with a classification setting instead of a 
regression setting. 

Symmetry can be easily incorporated in our framework via the following 
modification of the Kronecker kernel: 

Kts(e,e) = l(K*(v,v)K*(v',lf) + K*(v,i/)K*(v' ,v)) . (8) 
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The symmetric Kronecker kernel has been previously used for pre dicting protein- 



protein interactions in bioinformatics ( Ben-Hur and Noblel 2005). The following 



theorem shows that the RKHS of the symmetric Kronecker kernel can approxi- 
mate arbitrarily well any type of continuous symmetric relation. 

Theorem 3.5. tWaeaeman et ai \20lA ) Let 

S(V 2 ) = {t\teC(V 2 ),t(v,v') = t(v',v)} 

be the space of all continuous symmetric relations from V 2 to R. If on V is 
universal, then the RKHS induced by the kernel K® s defined in @) is dense in 

s(v 2 ). 

In other words the above theorem states that using the symmetric Kronecker 
product kernel is a way to incorporate the prior knowledge about the symmetry 
of the relation to be learned by only sacrificing the unnecessary expressive power. 
Thus, consistency can still be assured, despite considering a smaller hypothesis 
space. 



3.3 Reciprocal Relations 

Let us start with a definition of this type of relation. 

Definition 3.6. A binary relation Q : V 2 — > [0, 1] is called a reciprocal relation 
if for all (v,v') £ V 2 it holds that Q(v,v r ) = 1 - Q(v',v). 

For general real- valued relations, the notion of antisymmetry can be used in 
place of reciprocity: 

Definition 3.7. A binary relation h : V 2 — > K is called an antisymmetric 
relation if for all {v,v r ) £ V 2 it holds that h{v,v') — —h(v',v). 

For reciprocal and antisymmetric relations, every edge e = (u, v') in a multi- 
graph like Figure [1] induces an unobserved invisible edge en = (v',v) with 
appropriate weight in the opposite direction. Applications arise here in domains 
such as preference learning, game theory and bioinformatics for representing 
preference relations, choice probabilities, winning probabilities, gene regulation, 
etc. The weight on the edge defines the real direction of such an edge. If the 
weight on the edge e = (v, v') is higher than 0.5, then the direction is from 
v to v' , but when the weight is lower than 0.5, then the direction should be 
interpreted as inverted, for example, the edges from A to C in Figures [1] (a) 
and (e) should be interpreted as edges starting from A instead of C. If the 
relation is 3-valued as Q : V 2 — > {0, 1/2, 1}, then we end up with a three-class 
ordinal regression setting instead of an ordinary regression setting. Analogously 
to symmetry, reciprocity can also be easily incorporated in our framework via 
the following modification of the Kronecker kernel: 

K% R {e,e) = htK+(v,v)K+W,Tf) - K*{vrf)K+{vf ,v)) . (9) 

Thus, the addition of kernels in the symmetric case becomes a subtraction of 
kernels in the reciprocal case. One can also prove that the RKHS of this so- 
called reciprocal Kronecker kernel allows approximating arbitrarily well any type 
of continuous reciprocal relation. 
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Theorem 3.8. \Waeaeman et ai \20lA ) Let 



R(V 2 ) ={t\te C{V 2 ),t{v, v') = -t(v', v)} 

be the space of all continuous antisymmetric relations from V 2 to M. If on V 
is universal, then the RKHS induced by the kernel K^ R defined in (OJ) is dense 
mR{V 2 ). 

Unlike many existing kernel-based methods for relational data, the models 
obtained with the presented kernels are able to represent any symmetric or recip- 
rocal relation respectively, without imposing additional transitivity properties 
of the relations. 



4 Algorithmic Aspects 

This section gives a detailed description of the different algorithms that we 
propose for conditional ranking tasks. Our algorithms are primarily based on 
solving specific systems of linear equations, in which domain knowledge about 
the underlying relations is taken into account. In addition, a detailed discussion 
about the differences between optimizing (J5J) and ^ is provided. 



4.1 Matrix Representation of Symmetric and Reciprocal 
Kernels 

Let us define the so-called commutation matrix, which provides a powerful tool 
for formalizing the kernel matrices corresponding to the symmetric and recip- 
rocal kernels. 

Definition 4.1 (Commutation matrix). The s 2 x s 2 -matrix 

s s 

i=l j=l 

is called the commutation matrix (Abadir and where are the 

2 

standard basis vectors o/R s . 

We use the superscript s 2 to indicate the dimension s 2 x s 2 of the matrix 
P but we omit this notation when the dimensionality is clear from the context 
or when the considerations do not depend on the dimensionality. For P, we 
have the following properties. First, PP = I, where I is the identity matrix, 
since P is a symmetric permutation matrix. Moreover, for every square matrix 
M G K sxs , we have Pvec(M) = vec(M T ), where vec is the column vectorizing 
operator that stacks the columns of an s x s-matrix in an s 2 -dimensional column 
vector, that is, 

vec(M) - (My, M 2 ,i, . . . , M Sil , M 1>2 , . . . , M S , S ) T . (10) 

Furthermore, for M,N e W xt , we have 

P* 2 (M (g> N) = (N <g> M)P* 2 . 

The commutation matrix is used as a building block in constructing the 
following types of matrices: 
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Definition 4.2 (Symmetrizer and skew-symmetrizer matrices). The matrices 

S = i(I + P) ondA=i(I-P) 

are known as the s ymme trizer and skew-symmetrizer matrix, respectively 
(Abadir and Maanut , 200 A) . 

From the properties of the commutation matrix, it is straightforward to 
derive the following properties of S and A. First, for M £ K sxs , we have 

Svec(M) = ivcc(M + M T ) (11) 

Avec(M) = ivcc(M - M T ) . 

Moreover, the matrices S and A arc idcmpotent, that is, 

SS = S and AA = A . 



Furthermore, S and A are orthogonal to each other 

SA = AS = 0. (12) 

Finally, for M £ R sxt the symmetrizer and skew-symmetrizer matrices commute 
with the s 2 x i 2 -matrix M ® M in the following sense: 

S* 2 (M ® M) = (M ® M)S* 2 (13) 
A 5 ' (MOM) = (M ® M)A' 2 (14) 

where S s and A s in the left-hand sides are s 2 x s 2 -matrices and S* and A* 
are t 2 x t 2 -matrices in the right-hand sides. 

Armed with the above definitions, we will now consider how the kernel ma- 
trices corresponding to the reciprocal kernel K® R and the symmetric kernel 
K Jj g can be represented in a matrix notation. Note that the next proposition 
covers also the kernel matrices constructed between, say, nodes encountered in 
the training set and the nodes encountered at the prediction phase, and hence 
the considerations involve two different sets of nodes. 

Proposition 4.3. Let K £ W xp be a kernel matrix consisting of all kernel 
evaluations between nodes in sets V C V and V C V, with \ V\ — r and \ V\ = p, 
that is, Kjj = K^(vi,Vj), where Vi £ V andvj £ V. The ordinary, symmetric 
and reciprocal Kronecker kernel matrices consisting of all kernel evaluations 
between edges in V x V and edges in V x V are given by 

K = K <g> K , K 5 = S' r2 (K«)K), K. R = A r2 (K ® K) . 

Proof. The claim concerning the ordinary Kronecker kernel is an immediate 
consequence of definition of the Kronecker product, that is, the entries of K are 
given as 

K(h-l)r+i,(j-l)p+k = K' p (v h ,v j )K' , '(vi,v k ), 

where 1 < h, i < r and 1 < j,k < p. To prove the other two claims, we pay 
closer attention to the entries of K (S> K. In particular, the ((j — l)p + k)-th 



11 



column of K (g) K contains all kernel evaluations of the edges in V x V with the 
edge (Uj,Vk) £ V x V. By definition (fTU| of vec, the column can be written as 
vec(M), where M <E W xr is a matrix whose i,h-th entry contains the kernel 
evaluation between the edges (vh,Vi) £ V x V and (Vj,Vk) £ V x V: 

The ((j — l)p + fc)-th column of S r (K K) can, according to (fTTj) . be written 
as ^vec(M + M T ), where the i,h-th entry of M + M T contains the kernel 
evaluation 

X - (K * ( v h , vj )K* (vi , v k ) + ( Vi , v 3 )K* (v h , v k )) , 

which corresponds to the symmetric Kronecker kernel between the edges 
(vi,Vh) £ V x V and (Vj,Vk) £ V x V. The reciprocal case is analogous. □ 

Note that, due to (IT5|) . the above considered symmetric Kronecker kernel 
matrix may as well be written as (K ® K)S P or as S r (K ® K)S P . The same 
applies to the reciprocal Kronecker kernel matrix due to (fl4|) . 



4.2 Regression with Symmetric and Reciprocal Kernels 

Let p and q, respectively, represent the number of nodes and edges in T. In 
the following, we make an assumption that T contains, for each ordered pair of 
nodes [v, v'), exactly one edge starting from v and ending to v ' , that is, q = p 2 
and T corresponds to a complete directed graph on p nodes which includes a loop 
at each node. As we will show below, this important special case enables the 
use of many computational short-cuts for the training phase. This assumption 
is dropped in Section 14.41 where we present training algorithms for the more 
general case. 

Using the notation of Proposition ^. 31 we let K £ MP xp be the kernel matrix 
of K^, containing similarities between all nodes encountered in T. Due to 
the above assumption and Proposition 14.31 the kernel matrix containing the 
evaluations of the kernels K®, K® s and K® R between the edges in T can be 

expressed as K, K and K , respectively. 

Recall that, according to the representer theorem, the prediction function 
obtained as a solution to problem ((T|) can be expressed with the dual representa- 
tion ([2]), involving a vector of so-called dual parameters, whose dimension equals 
the number of edges in the training set. Here, we represent the dual solution 
with a vector a £ W containing one entry per each possible edge between the 
vertices occurring in the training set. 

Thus, using standard Tikhonov regularization ( Evgeniou et al . 2000h . the 



objective function of problem ([T]) with kernel if* can be rewritten in matrix 
notation as 

£(y,Ka) + Aa T Ka, (15) 

where L : M. q x M. q — » M is a convex loss function that maps the vector y of 
training labels and the vector Ka of predictions to a real value. 

Up to multiplication with a constant, the loss ([5]) can be represented in 
matrix form as 

(y-Ka) T (y-Ka). (16) 
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Thus, for the regression approach, the objective function to be minimized be- 
comes: 

(y-Ka) T (y-Ka) + Aa T Ka. (17) 

By taking the derivative of (|17p with respect to a, setting it to zero, and solving 
with respect to a, we get the following system of linear equations: 



(KK + AK)a = Ky. (18) 

If the kernel matrix K is not strictly positive definite but only positive semi- 
definite, K should be interpreted as lim e ^ +(K + el). Accordingly, (TT51) can be 
simplified to 

(K + AI)a = y. (19) 

Due to the positive semi-definiteness of the kernel matrix, (|19p always has a 
unique solution. Since the solution of (fH)]) is also a solution of (fT8"|) . it is enough 
to concentrate on solving 

Befor e continuing, we recall cer tain rules concerning the Kronecker product 
(see e.g. Horn and Johnson ( 199lh ) and introduce some notation. Namely, for 
M € M axb , U € K cxd , N € K bxs and V G R dxt , we have: 

(M <g) U)(N ® V) = (MN) ® (UV). 

From this, it directly follows that 

(M N) -1 = MT 1 N _1 . (20) 

Moreover, for M G R axfc , N G R bxc , and U G R cxd , we have: 

(U T ® M)vec(N) = vec(MNU). 

Furthermore, for M, N € M axfc , let M N denote the Hadamard (elementwise) 
product, that is, (M N)jj = M^Njj. Further, for a vector v G M s , let 
diag(v) denote the diagonal s x s-matrix, whose diagonal entries are given as 
diag(v) M = Vi. Finally, for M, N e K axfc , we have: 

vec(M N) = diag(vec(M))vec(N). 

Using the above notation and rules, we show how to efficiently solve shifted 
Kronecker product systems. For a more in-depth analysis of th e shifted Kro- 
necker product systems, we refer to Martin and Van Loan! ( 2006f ). 

Proposition 4.4. Let M, N e W xp be diagonalizable matrices, that is, the 
matrices can be eigen decomposed as 

M = VAV-\ N = U£tr 1 , 

where V, U G M pxp contain the eigenvectors and the diagonal matrices A, S G 
W xp contain the corresponding eigenvalues of M and N. Then, the following 
type of shifted Kronecker product system 

(M0N + AI)a = vec(Y), (21) 

where A > and Y G M pxp , can be solved with respect to a in 0(p 3 ) time if the 
inverse o/M®N + AI exists. 
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Proof. By multiplying both sides of Eq. dH} with (M <g> N + AI) -1 from the 
left, we get 

a = (M®N + AI)- 1 vec(Y) 

= ((VAV- 1 ) <g> (UEtr 1 ) + AI) _1 vec(Y) 

= ((V ® U)(A ® S)^- 1 ® IT 1 ) + AI)- 1 vec(Y) 

= (V ® U)(A ®S + AI) -1 (V -:1 (8 U _1 )vec(Y) (22) 

= (V ® U)(A (g) £ + AI)- 1 vcc(Tr 1 YV- T ) 

= (V ® U)vec(C E) 

= vec(U(C0E)V T ), (23) 

where E = U" 1 YV- T and diag(vec(C)) = (A® S + AI) -1 . In line (J22J, we use 
(|20p and therefore we can write AI = A(V®U)(V _1 ®U _1 ) after which we can 
add A directly to the eigenvalues A <E) S of M <Ei N. The eigen decompositions 
of M and N as well as all matrix multiplications in ([25)1 can be computed in 
0(p 3 ) time. □ 

Corollary 4.5. A minimizer of JiTj ) can 6e computed in 0(p 3 ) time. 

Proof. Since the kernel matrix K is symmetric and positive semi-definite, it is 
diagonalizable and it has nonnegative eigenvalues. This ensures that the matrix 
K ® K + AI has strictly positive eigenvalues and therefore its inverse exists. 
Consequently, the claim follows directly from Proposition 14.41 which can be 
observed by substituting K for both M and N. □ 

We continue by considering the use of the symmetric and reciprocal Kro- 
necker kernels and show that, with those, the dual solution can be obtained as 
easily as with the ordinary Kronecker kernel. We first present and prove the 
following two inversion identities: 

Lemma 4.6. Let N = N N for some square matrix N. Then, 

(SNS + AEr 1 = S(N + AI)- 1 S + iA, (24) 

A 

(ANA + AI) -1 = A(N + AI) _1 A + ^-S , (25) 

A 

if the considered inverses exist. 

Proof. We prove (pH)) by multiplying SNS + AI with its alleged inverse matrix 
and show that the result is the identity matrix: 

(SNS + AI)(S(N + AI) _1 S + —A) 

A 

= SNSS(N + AI)- 1 S + isNSA (26) 

A 

+AS(N + AI)" 1 S + A^A 
A 

= SN(N + AI)" 1 + AS(N + AI)" 1 + A (27) 
= S(I-A(N + AI) -1 ) + AS(N + AI) -1 + A (28) 
= S-AS(N + AI)- 1 +AS(N + AI)- 1 +A 
= I 
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When going from (P2rJ)) to (j2"7) . we use the fact that S commutes with (N+AI) -1 , 
because it commutes with both N and I. Moreover, the second term of ([25)1 
vanishes, because of the orthogonality of S and A to each other. In (f2"51) we 
have used the following inve rsion identity known in matrix calculus literature 



([Henderson and Sear lei 1 198 lh 



n(n + i)- 1 = i - (N + rr 1 . 

Identity (f2l)j) can be proved analogously. □ 

These inversion identities indicate that we can invert a diagonally shifted 
symmetric or reciprocal Kronecker kernel matrices simply by modifying the 
inverse of a diagonally shifted ordinary Kronecker kernel matrix. This is an 
advantageous property, since the computational short-cuts provided by Propo- 
sition 14.41 ensure the fast inversion of the shifted ordinary Kronecker kernel 
matrices, and its results can thus be used to accelerate the computations for the 
symmetric and reciprocal cases too. 

The next result uses the above inversion identities to show that, when 
learning symmetric or reciprocal relations with kernel ridge regression 



(Saunders et a' 



1998; ISuvkens et a i l2002t [Sh awe- Tavlor and Cristianinil . I2004L 



Pahikkala et all . 120091) , we do not explicitly have to use the symmetric and re- 
ciprocal Kronecker kernels. Instead, we can just use the ordinary Kronecker 
kernel to learn the desired model as long as we ensure that the symmetry or 
reciprocity is encoded in the labels. 

Proposition 4.7. Using the symmetric Kronecker kernel for RLS regression 
with a label vector y is equivalent to using an ordinary Kronecker kernel and a 
label vector Sy. One can observe an analogous relationship between the reciprocal 
Kronecker kernel and a label vector Ay. 

Proof. Let 

a = (SKS + AI) _x y 
b = (K + AI) -1 Sy 

be solutions of (I17p with the symmetric Kronecker kernel and label vector y 
and with the ordinary Kronecker kernel and label vector Sy, respectively. Using 
identity ([2"4"|1. we get 

a = (S(K + AI)- 1 S + iA)y 

A 

= (K + AI^Sy + ^Ay. 

In the last equality, we again used the fact that S commutes with (K + AI) _1 , 
because it commutes with both K and I. Let (v,v r ) be a new couple of nodes 
for which we are supposed to do a prediction with a regressor determined by the 
coefficients a. Moreover, let k v ,k v / £ M. p denote, respectively, the base kernel 
K^ evaluations of the nodes v and v' with the nodes in the training data. Then, 
k v eg) k v / £ R 9 contains the Kronecker kernel K% evaluations of the edge (v, v') 
with all edges in the training data. Further, according to Proposition 14.31 the 



15 



corresponding vector of symmetric Kronecker kernel evaluations is S(k v <g) k v <). 
Now, the prediction for the couple (v,v r ) can be expressed as 

(k v ®k v< ) T Sa = (k v ®k v O T S((K + AI)- 1 Sy+iAy) 

A 

= (k v ®k v O T S(K + AI)- 1 Sy 

+ i(k v ®k v T SAy (29) 
A 

= (k v ®k v O T S(K + AI)- 1 Sy 
= (k v <8> k v ,) T Sb, 

where term (f2"9"f vanishes due to (IT21 . The analogous result for the reciprocal 
Kronecker kernel can be shown in a similar way. □ 

As a consequence of this, we also have a computationally efficient method 
for RLS regression with symmetric and reciprocal Kronecker kernels. Encoding 
the properties into the label matrix ensures that the corresponding variations 
of the Kronecker kernels are implicitly used. 



4.3 Conditional Ranking with Symmetric and Reciprocal 
Kernels 

Now, we show how loss function ([5]) can be represented in ma trix form. This 
repres entation is similar to the RankRLS loss introduced by IPahikkala et all 
(|2009h . Let 

C' = I-il z l zT , (30) 

where I € N, I is the I x /-identity matrix, and l l € M. 1 is the vector of which 
every entry is equal to 1, be the / x /-centering matrix. The matrix C l is an 
idempotent matrix and multiplying it with a vector subtracts the mean of the 
vector entries from all elements of the vector. Moreover, the following equality 
can be shown 

iE(^) 2 = r T c'c, 

i,j=l 

where Cj are the entries of a vector c. Now, let us consider the following quasi- 
diagonal matrix: 

( Ch \ 
L= •.. , (31) 

V ) 

where U is the number of edges starting from Vi for i 6 {1, . . . ,p}. Again, given 
the assumption that the training data contains all possible edges between the 
nodes exactly once and hence k = p for all 1 < i < p, loss function © can be, 
up to multiplication with a constant, represented in matrix form as 

(y - Ka) T L(y - Ka) , (32) 

provided that the entries of y — Ka arc ordered in a way compatible with the 
entries of L, that is, the training edges are arranged according to their starting 
nodes. 
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Analogously to the regression case, the training phase corresponds to solving 
the following system of linear equations: 



(K T LK + AK)a = K T Ly. (33) 

If the ordinary Kronecker kernel is used, we get a result analogous to Corol- 
lary @3] 

Corollary 4.8. A solution of 133\) can be computed in 0(p 3 ) time. 

Proof. Given that h = p for all 1 < i < p and that the ordinary Kronecker 
kernel is used, matrix (|3~Tj) can be written as (I ® C p ) and the system of linear 
equations (f3"3"]l becomes: 

((K ® K)(I ® CP)(K ® K) + AK ® K)a 
= (K<8K)(I<8C*)y. 

While the kernel matrix K ® K is not necessarily invertible, a solution can still 
be obtained from the following reduced form: 

((K ® K)(I ® C p ) + AI)a = (I (g) C p )y . 

This can, in turn, be rewritten as 

(K ® KC P + AI)a = (I ® C p )y . (34) 

The matrix C p is symmetric, and hence if K is strictly positive definite, 
the product KC P is diag onalizable and has nonnegative eigenvalues (see e.g. 



(jHorn and Johnsonl . ll985l p. 465)). Therefore, (|3"4"|) is of the form which can be 



solved in 0(p 3 ) time due to Proposition 14.41 The situation is more involved if 
K is positive semi-definite. In this cas e, we can solve the so-called primal form 
with an empirical kernel map (see e.g. Airola et all ( 2011af) ) instead of (13"4")) and 



again end up with a Kronecker system solvable in 0(p 3 ) time. We omit the 
details of this consideration due to its lengthiness and technicality. □ 



4.4 Conjugate Gradient-Based Training Algorithms 

Interestingly, if we use the symmetric or reciprocal Kronecker kernel for condi- 
tional ranking, we do not have a similar efficient closed-form solution as those 
indicated by Corollaries 14.51 and 14.81 The same concerns both regression and 
ranking if the above assumption of the training data having every possible edge 
between all nodes encountered in the training data (i.e. hi — p for all 1 < i < p) 
is dropped. Fortunately, we can still design algorithms that take advantage of 
the special structure of the kernel matrices and the loss function in speeding 
up the training process, while they are not as efficient as the above-described 
closed-form solutions. 

Before proceeding, we introduce some extra notation. Let B <E {0, 1}i x p 
be a bookkeeping matrix of the training data, that is, its rows and columns are 
indexed by the edges in the training data and the set of all possible pairs of 
nodes, respectively. Each row of B contains a single nonzero entry indicating to 
which pair of nodes the edge corresponds. This matrix covers both the situation 
in which some of the possible edges are not in the training data and the one in 
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which there are several edges adjacent to the same nodes. Objective function 
(Tl5)) can be written as 

£(y, BKa) + Aa T Ka 

with the ordinary Kronecker kernel and analogously with the symmetric and 
reciprocal kernels. Note that the number of dual variables stored in vector a is 
still p 2 , while the number of labels in y is q. If an edge is not in the training 
data, the corresponding entry in a is zero, and if a particular edge occurs several 
times, the corresponding entry is the sum of the corresponding variables a e in 
representation @. For the ranking loss, the system of linear equations to be 
solved becomes 

(KB T LBK + AK)a = KB T Ly. 

If we use an identity matrix instead of L in (J2H) , the system corresponds to the 
regression loss. 

To solve the above type of linear systems, we consider an approach based 
on conjugate gradient type of methods with early stopping regularization. The 
Kronecker product (K<X>K)v can be written as vec(KVK), where v = vec(V) £ 
MP and V £ M pxp . Computing this product is cubic in the number of nodes. 
Moreover, multiplying a vector with the matrices S or A does not increase the 
computational complexity, because they contain only 0(p 2 ) nonzero elements. 
Similarly, the matrix B has only q non-zero elements. Finally, we observe from 
([50]) and (f3"Tj) that the matrix L can be written as L = I — QQ T , where Q <E 
M. p xq is the following quasi-diagonal matrix: 



Q 



( vTT 1 ' 1 \ 



1 1 lp 

V 7^ / 



(35) 



The matrices I and Q both have 0(q) nonzero entries, and hence multiplying a 
vector with the matrix L can also be performed in 0{q) time. 

Conjugate gradient methods require, in the worst case, 0(j> ) iterations in 
order to solve the system of linear equations (|33p under consideration. However, 
the number of iterations required in practice is a small constant, as we will show 
in the experiments. In addition, since using early stopping with grad ient- based 



metho ds has a regularizing effect on the learning process (see e.g. lEngl et al 



(1996)), ;his approach can be used instead of or together with Tikhonov regu- 



larization. 



4.5 Theoretical Considerations 

Next, we give theoretical insights to back the idea of using RankRLS-based 
learning methods instead of ordinary RLS regression. As observed in Sect ion B~3l 
the main difference between RankRLS and the ordinary RLS is that RankRLS 
enforces the learned models to be block-wise centered, that is, the aim is to 
learn models that, for each node v, correctly predict the differences between 
the utility values of the edges (v,v r ) and (v,v"), rather than the utility values 
themselves. This is common for most of the pairwise learning to rank algorithms, 
since learning the individual utility values is, in ranking tasks, relevant only 
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with relation to other utility values. Below, we consider whether the block- 
wise centering approach actually helps in achieving this aim. This is done via 
analyzing the regression performance of the utility value differences. 

We start by considering the matrix forms of the objective functions of the 
ordinary RLS regression 

J(a) = (y - Ka) T (y - Ka) + Aa T Ka (36) 

and RankRLS for conditional ranking 

F(a) = (y - Ka) T L(y - Ka) + Aa T Ka, (37) 

where L € M. qxq is a quasi-diagonal matrix whose diagonal blocks are p x p- 
centering matrices. Here we make the further assumption that the label vector 
is block-wise centered, that is, y = Ly. We are always free to make this as- 
sumption with conditional ranking tasks. 

The following lemma indicates that we can consider the RankRLS problem 
as an ordinary RLS regression problem with a modified kernel. 

Lemma 4.9. Objective functions and 

W{a) = (y - LKLa) T (y - LKLa) + Aa T LKLa (38) 
have a common minimizer. 

P roof. By repeatedly applying the idempotence of L and the inversion identities 



of iHenderson and Searld (|1981l ). one of the solutions of (I5T1) can be written as 

a = (LK + AI) _1 Ly 

= (LLK + AI) _1 Ly 

= L(LKL + AI) _1 y 

= L(LKLL + AI) _1 y 

= (LLKL + AI) _1 Ly 

= (LKL + AI) _1 y> 

which is also a minimizer of (1381). □ 



This lemma provides us a different perspective on RankRLS. Namely, if we 
have a prior knowledge that the underlying regression function to be learned is 
block- wise centered (i.e. we have a conditional ranking task), this knowledge is 
simply encoded into a kernel function, just like we do with the knowledge about 
the reciprocity and symmetry. 

In the literature, there are many results (see e.g. De Vito et al (2005) and 



references therein) indicating that the expected prediction error of the regu- 
larized least-squared based kernel regression methods obey the following type 
of probabilistic upper bounds. For simplicity, we only consider the regression 
error. Namely, for any < rj < 1, it holds that 

P [l[h, T ] - w£ f&iK I[f] < B(X, K, v)]>l- r,. (39) 

where P[] denotes the probability, /[■] is the expected prediction error, f\,T is 
the prediction function obtained via regularized risk minimization on a training 
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set T and a regularization parameter A, Hk is the RKHS associated to the 
kernel K, and B(X,K,ri) is a complexity term depending on the kernel, the 
amount of regularization, and the confidence level rj. 

According to Lemma 14.91 if the underlying regression function y is block- 
wise centered, which is the case in the conditional ranking tasks, we can consider 
learning with conditional RankRLS as performing regression with a block-wise 
centered kernel, and hence the behavior of RLS regression and RankRLS can be 
compared with each other under the framework given in (|39[) . When comparing 
the two kernels, we first have to pay attention to the corresponding RKHS 
constructions Hk- The RKHS of the original kernel is more expressive than 
that of the block-wise centered kernel, because the former is able to express 
functions that are not block-wise centered while the latter can not. However, 
since we consider conditional ranking tasks, this extra expressiveness is of no 
help and the terms inf/ e ^ K /[/] are equal for the two kernels. 

Next, we focus our attention on the complexity t erm. A typical example of 
the term is the one proposed bv lDe Vito et al ( 2005 ). which is proportional to 
k = sup e K{e, e). Now, the quantity k is lower for the block- wise centered kernel 
than for the original one, and hence the former has tighter error bounds than 
the latter. This, in turn, indicates that RankRLS is indeed a more promising 
approach for learning to predict the utility value differences than the ordinary 
RLS. It would be interesting to extend the analysis from the regression error to 
the pairwise ranking error itself but the analysis is far m ore challenging and it 
is considered as an open problem bv lDe Vito eta] (120051) . 



5 Links with Existing Ranking Methods 

Examining the pairwise loss (j4|) reveals that there exists a quite straightforward 
mapping from the task of conditional ranking to that of traditional ranking. Re- 
lation graph edges are in this mapping explicitly used for training and prediction. 
In recent years, several algorithms for learning to rank have been proposed, 
which can be used for conditional ranki n g, by interpreting t he conditioning 
node as a query (see e.g. Joachims (|2002l ): iFreund et a] ( 2003 ): Pahikkala et all 



(200§)). The main application has been in information retrieval, where the 



examples are joint feature representations of queries and documents, and pref- 
erences are induced only between documents connected to the same query. One 
of the earliest and most successful o f thes e methods is the ranking support vec- 
tor machine RankSVM (jjoachimsl [2002), which optimizes the pairwise hinge 
loss. Even much more closely rela t ed is the ranking regularized least-squares 
method RankRLS ( Pahikkala et al . 200 91), previously proposed by some of the 



present authors. The method is based on minimizing the pairwise regularized 
squared loss and becomes equivalent to the algorithms proposed in this article, 
if it is trained directly on the relation graph edges. 

What this means in practice is that when the training relation graph is 
sparse enough, say consisting of only a few thousand edges, existing methods for 
learning to rank can be used to train conditional ranking models. In fact this is 
how we perform the rock-paper-scissors experiments, as discussed in Section fo.ll 
However, if the training graph is dense, existing methods for learning to rank 
are of quite limited use. 

Let us assume a training graph that has p nodes. Furthermore, we assume 
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that most of the edges in the graph are connected, meaning that the number 
of edges is of the order p 2 . Using a learning algorithm that explicitly calculates 
the kernel matrix for the edges would thus need to construct and store a p 2 x 
p 2 matrix, which is intractable already when p is less than thousand. When 
the standard Kronecker kernel is used toge ther with a linea r kernel for the 
nodes, primal training algorithms (see e.g. Joachims ( 20061 )) could be used 



without forming the kernel matrix. Assuming on average d non-zero features 
per node, this would result in having to form a data matrix with p 2 d 2 non-zero 
entries. Again, this would be both memory-wise and computationally infeasible 
for relatively modest values of p and d. 

Thus, building practical algorithms for solving the conditional ranking task 
requires computational shortcuts to avoid the above-mentioned space and time 
complexities. The methods presented in this article are based on such shortcuts, 
because queries and objects come from the same domain, resulting in a special 
structure of the Kronecker product kernel and a closed-form solution for the 
minimizcr of the pairwise regularized squared loss. 



6 Experiments 

In the experiments we consider conditional ranking tasks on synthetic and real- 
world data in various application domains, illustrating different aspects of the 
generality of our approach. The first experiment considers a potential applica- 
tion in game playing, using the synthetic rock-paper-scissors data set, in which 
the underlying relation is both reciprocal and intransitive. The task is to learn 
a model for ranking players according to their likelihood of winning against any 
other player on whom the ranking is conditioned. The second experiment con- 
siders a potential application in information retrieval, using the 20-newsgroups 
data set. Here the task consists of ranking documents according to their simi- 
larity to any other document, on which the ranking is conditioned. The third 
experiment summarizes a potential application of identifying bacterial species 
in microbiology. The goal consists of retrieving a bipartite ranking for a given 
species, in which bacteria from the same species have to be ranked before bacte- 
ria from a different species. On both the newsgroups and bacterial data we test 
the capability of the models to generalize to such newsgroups or species that 
have not been observed during training. 

In all the experiments, we run both the conditional ranker that minimizes 
the convex edgewise ranking loss approximation ([6]) and the method that min- 
imizes the regression loss ^ over the edges. Furthermore, in the rock-paper- 
scissors experiment we also train a conditional ranker with RankSVM. For the 
20-newsgroups and bacterial species data this is not possible due to the large 
number of edges present in the relational graph, resulting in too high memory 
requirements and computational costs for RankSVM training to be practical. 
We use the Kronecker kernel for edges in all the experiments. We also 
test the effects of enforcing domain knowledge by applying the reciprocal kernel 
K® R in the rock-paper-scissors, and applying the symmetric kernel K^ s in the 
20-newsgroups and bacterial data experiments. The linear kernel is used for in- 
dividual nodes (thus, for K^). In all the experiments, performance is measured 
using the ranking loss (J4j) on the test set. 

We use a variety of approaches for minimizing the squared conditional rank- 
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ing and regression losses, depending on the characteristics of the task. All 
the solvers based on optimizing the standard, or pairwise regularized least- 
squares loss are from the RLScore software packag^ For the experiment where 
the training is perform ed iteratively, we ap ply the biconjugate gradient stabi- 



lized method (BGSM) (|van der Vorsd. Il992h . The RankSVM based conditional 



ranker baseline considered in the rock-paper-scis sors experiment is trained with 



the TreeRankSVM software ( Airola et a 



jper-sciss< 
I l2011bl) . 



6.1 Game Playing: the Rock-Paper-Scissors Dataset 

T he synthetic ben c hmark data, whose generation process is described in detail 
bv lPahikkala et all (l2010bh . consists of simulated games of the well-known game 



of rock-paper-scissors between pairs of players. The training set contains the 
outcomes of 1000 games played between 100 players, the outcomes are labeled 
according to which of the players won. The test set consists of another group 
of 100 players, and for each pair of players the probability of the first player 
winning against the second one. Different players differ in how often they play 
each of the three possible moves in the game. The data set can be considered 
as a directed graph where players are nodes and edges played games, the true 
underlying relation generating the data is in this case reciprocal. Moreover, the 
relation is intransitive. It represents the probability that one player wins against 
another player. Thus, it is not meaningful to try to construct a global ranking 
of the players. In contrast, conditional ranking is a sensible task, where players 
are ranked according to their estimated probability of winning against a given 
player. 

We experiment with three different variations of the data set, the wl, wlO 
and wlOO sets. These data sets differ in how balanced the strategies played by 
the players are. In wl all the players have close to equal probability of playing 
any of the three available moves, while in wlOO each of the players has a favorite 
strategy he/she will use much more often than the other strategies. Both the 
training and test sets in the three cases are generated one hundred times and 
the hundred ranking results are averaged for each of the three cases and for 
every tested learning method. 

Since the training set consists of only one thousand games, it is feasible to 
adapt existing ranking algorithm implementations for solving the conditional 
ranking task. Each game is represented as two edges, labeled as +1 if the 
edge starts from the winner, and as —1 if the edge starts from the loser. Each 
node has only 3 features, and thus, the explicit feature representation where 
the Kronecker kernel is used together with a linear kernel results in 9 product 
features for each edge. In addition, we generate an analogous feature repre- 
sentation for the reciprocal Kronecker kernel. We use these generated feature 
representations for the edges to train three algorithms. RLS regresses directly 
the edge scores, RankRLS minimizes pairwise regularized squared loss on the 
edges, and RankSVM minimizes pairwise hinge loss on the edges. For RankRLS 
and RankSVM, pairwise preferences are generated only between edges starting 
from the same node. 

In initial preliminary experiments we noticed that on this data set regulariza- 
tion seemed to be harmful, with methods typically reaching optimal performance 



Available at http://staff.cs.utu.fi/~aatapa/software/RLScore/ 
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c.reg c.reg (r) crank crank (r) RankSVM RankSVM (r) 



w = 1 0.4875 0.4868 0.4876 0.4880 0.4987 0.4891 
w = 10 0.04172 0.04145 0.04519 0.04291 0.04535 0.04116 
w = 100 0.001380 0.001366 0.001424 0.001354 0.006997 0.005824 



Table 1: Overview of the measured rank loss for rock-paper-scissors. The ab- 
breviations c.reg and crank here refer to the RLS and RankRLS algorithm, 
respectively, and (r) refers to the use of a reciprocal Kronecker kernel instead 
of the ordinary Kronecker kernel. 



for close to zero regularization parameter values. Further, cross-validation as 
a parameter selection strategy appeared to work very poorly, due to the small 
training set size and the large amount of noise present in the training data. 
Thus, we performed the runs using a fixed regularization parameter is set to a 
close to zero value (2~ 30 ). 

The results of the experiments for the fixed regularization parameter value 
are presented in Table [TJ Clearly, the methods are successful in learning con- 
ditional ranking models, and the easier the problem is made, the better the 
performance is. For all the methods and data sets, except for the conditional 
ranking method with wl data, the pairwise ranking error is smaller when using 
the reciprocal kernel. Thus enforcing prior knowledge about the properties of 
the true underlying relation appears to be beneficial. On this data set, standard 
regression proves to be competitive with the pairwise ranking approaches. Sim- 
ilar results, where regression approaches can yield an equally good, or even a 
lower ra nking error than rank lo ss optimizers, ar e know n in the recent literature, 
see e.g. Pahikkala eta 1 (2009); iKotlowski et all (|20Tlh . Somewhat surprisingly, 



RankSVM loses to the other methods in all the experiments other than the 
w = 10 experiment with reciprocal kernel, with difference being especially large 
in the w = 100 experiment. 

In order to have a more comprehensive view of the differences between the 
RLS, RankRLS and RankSVM results, we plotted the average test performance 
for the methods over the 100 repetitions of the experiments, for varying regu- 
larization parameter choices. The results are presented in Figure [2l For w = 1 
and w — 10 data sets all the methods share similar behavior. The optimal 
ranking error can be reached for a range of smaller parameter values, until a 
point is reached where the error starts increasing. However, onw = 100 data 
sets RankSVM has quite a different type of behavior^. On this data, RankSVM 
can reach as good as, or even better performance than RLS or RankRLS, but 
only for a very narrow range of parameters. Thus, for this data prior knowledge 
about the suitable parameter value would be needed in order to make RankSVM 
work, whereas the other approaches are more robust as long as the parameter 
set to a fairly small value. 

In conclusion, we have shown in this section that highly intransitive relations 
can be modeled and successfully learned in the conditional ranking setting. 
Moreover, we have shown that when the relation graph of the training set is 



2 In order to ascertain that the difference was not simply caused by 

problems in the implementation or the underlying optimization library, we 

checked our results against those of the SVM ran ' ! implementation available at 
http : //www. cs . Cornell . edu/People/t j/svm_light/svm_rank. html 
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Figure 2: Rock-paper-scissors data. Ranking test error as a function of regular- 
ization parameter for the tested methods. Vertically: w=l (up), w=10 (middle), 
w=100 (bottom). Horizontally: standard Kronecker kernel (left), reciprocal ker- 
nel (right). 
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Newsgr. 1 Newsgr. 2 Bacterial 1 Bacterial 2 



c. rank 0.2562 0.2895 0.1082 0.07631 

c. reg 0.3685 0.3967 0.1084 0.07762 



Table 2: Overview of the measured rank loss for the 20- Newsgroups and the 
bacterial species ranking tasks in the large-scale experiments, where crank and 
c.reg are trained using the closed-form solutions. 



sparse enough, existing ranking algorithms can be applied by explicitly using 
the edges of the graph as training examples. Further, the methods benefit from 
the use of the reciprocal Kronecker kernel instead of the ordinary Kronecker 
kernel. Finally, for this dataset it appears that a regression-based approach 
performs as well as the pairwise ranking methods. 



6.2 Document Retrieval: the 20-Newsgroups Dataset 

In the second set of experiments we aim to learn to rank newsgroup documents 
according to their similarity with respect to a document the ranking is con- 
ditioned on. We use the publicly available 20-newsgroups data sefj for the 
experiments. The data set consists of documents from 20 newsgroups, each 
containing approximately 1000 documents, the document features are word fre- 
quencies. Some of the newsgroups are considered to have similar topics, such 
as the rec. sport. baseball, and rec. sport. hockey newsgroups, which both contain 
messages about sports. We define a three-level conditional ranking task. Given 
a document, documents from the same newsgroup should be ranked the high- 
est, documents from similar newsgroups next, and documents from unrelated 
newsgroups last. Thus, we aim to learn the conditional ranking model from an 
undirected graph, and the und erlying similarit y relation is a symmetric relation. 
The setup is similar to that of Agarw al (2006), the difference is that we aim to 



learn a model for conditional ranking instead of just ranking documents against 
a fixed newsgroup. 

Since the training relation graph is complete, the number of edges grows 
quadratically with the number of nodes. For 5000 training nodes, as considered 
in one of the experiments, this results already in a graph of approximately 
25 million edges. Thus, unlike in the previous rock-paper-scissors experiment, 
training a ranking algorithm directly on the edges of the graph is no longer 
feasible. Instead, we solve the closed-form presented in Proposition 14.81 At 
the end of this section we also present experimental results for the iterative 
conjugate gradient method, as this allows us to examine the effects of early 
stopping, and enforcing symmetry on the prediction function. 

In the first two experiments, where the closed-form solution is applied, 
we assume a setting where the set of available newsgroups is not static, but 
rather over time old newsgroups may wither and die out, or new groups 
may be added. Thus, we cannot assume, when seeing new examples, that 
we have seen documents from the same newsgroup already when training 
our model. We simulate this by selecting different newsgroups for test- 
ing than for training. We form two disjoint sets of newsgroups. Set 



3 Available at: http : //people . csail .mit . edu/jrennie/20Newsgroups/ 
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Figure 3: Experimental results for the 20-Newsgroups data in the small-scale ex- 
periment, in which all four models are learned using conjugate gradient descent 
algorithms (left). Runtime comparison for different conditional ranker training 
approaches when trained on a fully connected relation graph (right). 

1 contains the messages from the newsgroups rec.autos, rec. sport. baseball, 
comp.sys.ibm.pc. hardware and comp. windows. x, while set 2 contains the mes- 
sages from the newsgroups rec. motorcycles, rec. sport. hockey, comp. graphics, 
comp.os.ms-windows.misc and comp. sys. mac. hardware. Thus the graph formed 
by set 1 consists of approximately 4000 nodes, while the graph formed by set 

2 contains approximately 5000 nodes. In the first experiment, set 1 is used for 
training and set 2 for testing. In the second experiment, set 2 is used for train- 
ing and set 1 for testing. The regularization parameter is selected by using half 
of the training newsgroups as a holdout set against which the parameters are 
tested. When training the final model all the training data is re-assembled. 

The results for the closed-form solution experiments are presented in Ta- 
ble [3J Both methods are successful in learning a conditional ranking model that 
generalizes to new newsgroups which were not seen during the training phase. 
The method optimizing a ranking based loss over the pairs greatly outperforms 
the one regressing the values for the relations. 

Finally, we investigate whether enforcing the prior knowledge about the un- 
derlying relation being symmetric is beneficial. In this final experiment we use 
the iterative BGSM method, as it is compatible with the symmetric Kronecker 
kernel, unlike the solution of Proposition ^. 81 The change in setup results in an 
increased computational cost, since each iteration of the BGSM method costs as 
much as using Proposition 14. 81 to calculate the solution. Therefore, we simplify 
the previous experimental setup by sampling a training set of 1000 nodes, and 
a test set of 500 nodes from 4 newsgroups. The task is now easier than before, 
since the training and test sets have the same distribution. All the methods are 
trained for 200 iterations, and the test error is plotted. We do not apply any 
regularization, but rather rely on the regularizing effect of early stopping, as 
discussed in Section [4] 

Figure [3] contains the performance curves. Again, we see that the pair- 
wise ranking loss quite clearly outperforms the regression loss. Using prior 
knowledge about the learned relation by enforcing symmetry leads to increased 
performance, most notably for the ranking loss. The error rate curves are not 
monotonically decreasing, but rather on some iterations the error may momen- 
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tarily rise sharply. This is due to the behavior of the conjugate gradient opti- 
mization scheme, which sometimes takes steps that lead further away from the 
optimal solution. The performance curves flatten out within the 200 iterations, 
demonstrating the feasibility of early stopping. 

In conclusion, we have demonstrated various characteristics of our approach 
in the newsgroups experiments. We showed that the introduced methods scale 
to training graphs that consist of tens of millions of edges, each having a high- 
dimensional feature representation. We also showed the generality of our ap- 
proach, as it is possible to learn conditional ranking models even when the test 
newsgroups are not represented in the training data, as long as data from sim- 
ilar newsgroups is available. Unlike the earlier experiments on the rock-paper- 
scissors data, the pairwise loss yields a dramatic improvement in performance 
compared to a regression-based loss. Finally, enforcing prior knowledge about 
the type of the underlying relation with kernels was shown to be advantageous. 



6.3 Microbiology: Ranking Bacterial Species 

We also illustrate the potential of conditional ranking for multi-class classifica- 
tion problems with a huge number of classes. For such problems it often happens 
that many classes are not represented in the training dataset, simply because no 
observations of these classes are known at the moment that the training dataset 
is constructed and the predictive model is learned. It speaks for itself that ex- 
isting multi-class classification methods cannot make any correct predictions for 
observations of these classes, which might occur in the test set. 

However, by reformulating the problem as a conditional ranking task, one is 
still capable of extracting some useful information for these classes during the 
test phase. The conditional ranking algorithms that we introduced in this article 
have the ability to condition a ranking on a target object that is unknown during 
the training phase. In a multi-class classification setting, we can condition the 
ranking on objects of classes that are not present in the training data. To this 
end, we consider bacterial species identification in microbiology. 

In this application domain, one normally defines a multi-class classifica- 
tion problem with a huge number of classes as identifying bacterial species, 
given their fatt y acid methyl ester (FAME ) profile as input for the model 
(Slabbinck ct a 

I 1201(1 MacLeod et a | l2010h . Here wc reformulate this task 



as a conditional ranking task. For a given target FAME profile of a bacteria 
that is not necessarily present in the training dataset, the algorithm should rank 
all remaining FAME profiles of the same species higher than FAME profiles of 
other species. For the most challenging scenario, none of these FAME profiles 
appears in the training dataset. 

As a result, the underlying relational graph consists of two types of edges, 
those connecting FAME profiles of identical species and those connecting FAME 
profiles of different species. When conditioned on a single node, this setting 
realizes a bipartite ranking problem, based on an un derlying symmet r ic re lation. 

The data we used is described in more detail in Slabbinc k et a | (|2010h . Its 



original version consists of 955 FAME profiles, divided into 73 different classes 
that represent different bacterial species. A training set and two separate test 
sets were formed as follows. The data points belonging to the largest two classes 
were randomly divided between the training set, and test set 1. Of the remain- 
ing smaller classes, 26 were included entirely in the training set, and 27 were 
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combined together to form test set 2. The difference between the test sets 
was thus that FAME profiles from classes contained in test set 1 appear also 
in the training set, while this is not the case for test set 2. The resulting set 
sizes were as follows. Training set: 473 nodes, test set 1: 308 nodes and test 
set 2: 174 nodes. Since the graphs are fully connected, the number of edges 
grows quadratically with respect to the number of nodes. The regularization 
parameter is chosen on a separate holdout set. 

Due to the large number of edges, we train the rankers using the closed-form 
solution. We also ran an experiment where we tested the effects of using the 
symmetric Kronecker kernel, together with the iterative training algorithm. In 
this experiment, using the symmetric Kronecker kernel leads to a very similar 
performance as not using it, therefore we do not present these results separately. 

Table [5] summarizes the resulting rank loss for the two different test sets, 
obtained after training the conditional regression and ranking methods using the 
closed-form solutions. Both methods are capable of training accurate ranking 
models that can distinguish bacteria of the same and different species groups, 
as the conditioning data points. Furthermore, comparing the results for test set 
1 and 2, we note that for this problem it is not necessary to have bacteria from 
the same species present in both the test and training sets, for the models to 
generalize. In fact, the test error on test set 2 is lower than the error on test set 
1. The ranking-based loss function leads to a slightly better test performance 
than regression. 

6.4 Bioinformatics: Functional ranking of enzymes 

As a last application we consider the problem of ranking a database of enzymes 
according to their catalytic similarity to a query protein. This catalytic similar- 
ity, which serves as the relation of interest, represents the relationship between 
enzymes w.r.t. their biological function. For newly discovered enzymes, this 
catalytic similarity is usually not known, so one can think of trying to predict it 
using machine learning algorithms and kernels that describe the structure-based 
or sequence-based similarity between enzymes. The Enzyme Commission (EC) 
functional classification is commonly used to subdivide enzymes into functional 
classes. EC numbers adopt a four-label hierarchical structure, representing dif- 
ferent levels of catalytic detail. 

We base the conditional rankings on the EC numbers of the enzymes, in- 
formation which we assume to be available for the training data, but not at 
prediction time. This ground truth ranking can be deduced from the catalytic 
similarity (i.e. ground truth similarity) between the query and all database 
enzymes. To this end, we count the number of successive correspondences from 
left to right, starting from the first digit in the EC label of the query and the 
database enzymes, and stopping as soon as the first mismatch occurs. For ex- 
ample, an enzyme query with number EC 2.4.2.23 has a similarity of two with a 
database enzyme labeled EC 2.4.99.12, since both enzymes belong to the fam- 
ily of glycosyl transferases. The same query manifests a similarity of one with 
an enzyme labeled EC 2.8.2.23. Both enzymes are transferases in this case, but 
they show no further similarity in the chemistry of the reactions to be catalyzed. 

Our models were built and tested using a dataset of 1730 enzymes with 
known protein structures. All the enzyme structures had a resolution of at 
least 2.5 A, they had a binding site volume between 350 and 3500 A , and they 
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cb fp wfp mcs lpcs 



unsupervised 0.0938 0.1185 0.1533 0.1077 0.1123 
c. reg 0.0052 0.0050 0.0019 0.0054 0.0073 

c. rank 0.0049 0.0050 0.0019 0.0056 0.0048 

Table 3: A summary of the results obtained for the enzyme ranking problem. 



were fully EC annotated. For evaluation purposes our database contained at 
least 20 observations for every EC number, leading to a total of 21 different 
EC numbers comprising members of all 6 top level codes. A heat map of the 
catalytic similarity of the enzymes is given in Figure HI This catalytic similarity 
will be our relation of interest, constituting the output of the algorithm. As 
input we consider five state-of-the-art kernel matrices for enzymes, denoted cb 
(CavBase similarity), mcs (maximum common subgraph), lpcs (labeled point 
cloud superposition), wp (fingerprints) and wfp (weighted fingerpri nts). More 
detail s about the generation of these kernel matrices can be found in lStock et al 
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The dataset was randomized and split in four equal parts. Each part was 
withheld as a test set while the other three parts of the dataset were used 
for training and model selection. This process was repeated for each part so 
that every instance was used for training and testing (thus, four-fold outer 
cross-validation). In addition, a 10-fold inner cross validation loop was imple- 
me nted for estimating the op timal regularization parameter A, as recommended 
bv IVarma and Simonl (2006). The value of the hyperparameter was selected 
from a grid containing all the powers of 10 from 10~ 4 to 10 5 . The final model 
was trained using the whole training set and the median of the best hyperpa- 
rameter values over the ten folds. 

We benchmark our algorithms against an unsupervised procedure that is 
commonly used in bioinformatics for retrieval of enzymes. Given a specific 
enzyme query and one of the above similarity measures, a ranking is constructed 
by computing the similarity between the query and all other enzymes in the 
database. Enzymes having a high similarity to the query appear on top of the 
ranking, those exhibiting a low similarity end up at the bottom. More formally, 
let us represent the similarity between two enzymes by K : V 2 — > M, where V 
represents the set of all potential enzymes. Given the similarities K{y,v') and 
K[v, v") we compose the ranking of v' and v" conditioned on the query v as: 

v' > v v" & K(v, v') > K(v, v"). (40) 

This approach follows in principle the same methodology as a nearest neighbor 
classifier, but rather a ranking than a class label should be seen as the output 
of the algorithm. 

Table [3] gives a global summary of the results obtained for the different 
ranking approaches. All models score relatively well. One can observe that 
supervised ranking models outperform unsupervised ranking for all five kernels. 
Three important reasons can be put forward for explaining the improvement in 
performance. First of all, the traditional benefit of supervised learning plays an 
important role. One can expect that supervised ranking methods outperform 
unsupervised ranking methods, because they take ground-truth rankings into 
account during the training phase to steer towards retrieval of enzymes with 
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Figure 4: Heatmaps of the values used for ranking the database during one fold 
in the testing phase. Each row of the heat map corresponds to one query. The 
corresponding ground truth is given in the lower right picture. The supervised 
model is trained by optimizing the pairwise ranking loss. 



a similar EC number. Conversely, unsupervised methods solely rely on the 
characterization of a meaningful similarity measure between enzymes, while 
ignoring EC numbers. 

Second, we also advocate that supervised ranking methods have the ability to 
preserve the hierarchical structure of EC numbers in their predicted rankings. 
Figure U supports this claim. It summarizes the values used for ranking one 
fold of the test set obtained by the different models as well as the correspond- 
ing ground truth. So, for supervised ranking it visualizes the values h(v,v'), 
for unsupervised ranking it visualizes K(v,v'). Each row of the heatmap cor- 
responds to one query. For the supervised models one notices a much better 
correspondence with the ground truth. Furthermore, the different levels of cat- 
alytic similarity can be better distinguished. 

A third reason for improvement by the supervised ranking method can be 
found in the exploitation of dependencies between different similarity values. 
Roughly speaking, if one is interested in the similarity between enzymes v and 
u, one can try to compute the similarity in a direct way, or derive it from the 
similarity with a third enzyme z. In the context of inferring protein-protein 
interaction and signal transduction netwo rks, both methods are known as th e 
direct and indirect approach, respectively ([Vert et all . l2007t iGeurts et a 
We argue that unsupervised ranking boils down to a direct approach, while 
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supervised ranking should be interpreted as indirect. Especially when the kernel 
matrix contains noisy values, one can expect that the indirect approach allows 
for detecting the back bone entries and correcting the noisy ones. 

The results for the two supervised conditional ranking approaches are in 
many cases similar, with both models having same predictive performance on 
two of the kernels (fp and wfp). For one of the kernels (lpcs) ranking loss 
gives much better performance than the regression one, for another kernel (cb) 
ranking loss has a slight advantage, and in the remaining experiment (mcs) the 
regression approach performs slightly better. The correct choice of node-level 
kernel proves to be much more important than the choice of the loss function, as 
the supervised models trained using the wfp kernel clearly outperform all other 
approaches. 



6.5 Runtime Performance 

In the runtime experiment we compare the computational efficiency of con- 
ditional ranker training approaches considered in Section 2) We compare an 
off-the-shelf ranking algorithm trained directly on the edges (RankRLS), con- 
jugate gradient training with early stopping and the closed-form solution. We 
did not consider the RankSVM baseline method, as kernel RankSVM solvers 
have been previously shown to have worse scalability than kernel RankRLS 
( Pahikkala et al . 2009). The methods are run on subsets of the Reuters data, 



ranging from 10 nodes to 1000 nodes. Edges between all the nodes are included. 

The results are presented in Figure [3] First, let us consider the scaling 
behavior when training a standard ranking algorithm directly on the graph edges 
(RankRLS). The kernel RankRLS solver has cubic time complexity, training it 
on all the edges in the fully connected training graph thus results in 0(p e ) 
time complexity. It can be observed that in practice the approach does not 
scale beyond tens of nodes (thousands of edges), meaning that it cannot be 
applied beyond small toy problems. Similar behavior can be expected from any 
other kernel solver, that would attempt to explicitly compute the kernel matrix. 
Clearly, one needs to apply techniques such as the Kronecker product shortcuts 
considered in this work to make learning practical. 

In contrast, the iterative training algorithm (Early stopping CG) and the 
closed form solution allow efficient scaling to graphs with thousands of nodes, 
and hence millions of edges. While the iterative training method and the closed 
form share the same 0(p 3 ) asymptotic behavior, the closed-form solution al- 
lows an order of magnitude faster training, making it the method of choice 
whenever applicable. The results further demonstrate our claims about the 
scalability of the proposed algorithms to large dense graphs, as even with a 
non-optimized high-level programming language implementation (Python) , one 
can handle training a kernel solver on million edges in a matter of minutes. 



7 Conclusion 

We presented in this article a general framework for conditional ranking from 
various types of relational data, where rankings can be conditioned on unseen 
objects. Symmetric or reciprocal relations can be treated as two important 
special cases. We proposed efficient least-squares algorithms for optimizing re- 
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gression and ranking-based loss functions. Experimental results on synthetic 
and two real-world datasets confirm that the conditional ranking problem can 
be solved efficiently and accurately, and that optimizing a ranking-based loss 
can be beneficial, instead of aiming to predict the underlying relations directly. 
Moreover, we also showed empirically that incorporating domain knowledge 
about the underlying relations can boost the predictive performance. 

Briefly summarized, we have discussed the following three approaches for 
solving conditional ranking problems: 

• off-the-shelf ranking algorithms can be used when they can be compu- 
tationally afforded, i.e., when the number of edges in the training set is 
small; 

• the above-presented approach based on the conjugate gradient method 
with early stopping and taking advantage of the special matrix structures 
is recommended when using off-the-shelf methods becomes intractable; 

• the closed-form solution presented in Proposition ^. 8l is recommended if its 
requirements are fulfilled, since its computational complexity is equivalent 
to that of a single iteration of the conjugate gradient method. 

For a given application domain, the domain expert has to decide which of these 
three cases is most appropriate. 
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